In this chapter, you will learn:
Training Instance | Diameter (inches) | Price (dollars) |
---|---|---|
1 | 6 | 7 |
2 | 8 | 9 |
3 | 10 | 13 |
4 | 14 | 17.5 |
5 | 18 | 18 |
In [51]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')
# X is the explanatory variable data structure
X = [[6], [8], [10], [14], [18]]
# Y is the response variable data structure
y = [[7], [9], [13], [17.5], [18]]
# instantiate a pyplot figure object
plt.figure()
plt.title('Figure 1. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.show()
Based on the visualization about, we can see that there is a positive relationship between pizza diameter and price.
In [52]:
from sklearn.linear_model import LinearRegression
# Training Data
# X is the explanatory variable data structure
X = [[6], [8], [10], [14], [18]]
# Y is the response variable data structure
y = [[7], [9], [13], [17.5], [18]]
# Create and fil the model
model = LinearRegression()
# Fit the model to the training data
model.fit(X, y)
# Make a prediction about how much a 12 inch pizza should cost
test_X = [12]
prediction = model.predict(test_X)
print 'A 12\" pizza should cost: $%.2f' % prediction[0]
The sklearn.linear_model.LinearRegression
class is an estimator. Given a new value of the explanatory variable, estimators predict a response value. All estimators have the fit()
and predict()
methods
fit()
is used to learn the parameters of a model, while predict()
predicts the value of a response variable given an explanatory variable value.
The mathematical specification of a simple regression model is the following:
$${y} = \alpha+ \beta{x}$$Where:
In [56]:
# instantiate a pyplot figure object
plt.figure()
# re-plot a scatter plot
plt.title('Figure 2. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
# create the line of fit
line_X = [[i] for i in np.arange(0, 25)]
line_y = model.predict(line_X)
plt.plot(line_X, line_y, '-b')
plt.show()
Training a model to learn the values of the parameters for simple linear regression to create the best unbiased estimator is called ordinary least squares or linear least squares. To get a better idea of what "best unbiased estimator" is estimating in the first place, let's define what is needed to fit a model to training data.
How do we know that the parameter values specified by a particular model is doing well or poorly? In other words, how can we assess which parameters produced the best-fitting regression line?
The cost function or loss function provides a function that measures the error of a model. In order to find the best-fitting regression line, the goal is to minimize the sum of the differences between the predicted prices and the corresponding observed prices of the pizzas in the training set, also known as residuals or training errors.
We can visualize the residuals by drawing a vertical line from the observed price and the predicted price. Fortunately, matplotlib provides the vlines()
that takes the x
, ymin
, and ymax
arguments to draw a vertical line on a plot. We re-create Figure 2 but with the residuals this time.
In [58]:
# instantiate a pyplot figure object
plt.figure()
# re-plot a scatter plot
plt.title('Figure 3. Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
# create the line of fit
line_X = [[i] for i in np.arange(0, 25)]
line_y = model.predict(line_X)
plt.plot(line_X, line_y, '-b')
# create residual lines
for x_i, y_i in zip(X, y):
plt.vlines(x_i[0], y_i[0], model.predict(x_i), colors='r')
plt.show()
Now that we can clearly see the prediction error (in red) made by our model (in blue), it's important to quantify the overall error through a formal definition of residual sum of squares.
We do this by summing the squared residuals for all of our training examples (we square the residuals because we don't care whether the error is in the positive or negative direction).
$$RSS = \sum_{i=1}^n\big(y_{i} - f(x_{i})\big)^2 $$Where:
A related measure of model error is mean squared error, which is simply the mean of the residuals:
$$MSE = \dfrac{1}{n}\sum_{i=1}^n\big(y_{i} - f(x_{i})\big)^2 $$Let's go ahead and implement RSS and MSE using numpy:
In [64]:
import numpy as np
rrs = np.sum((model.predict(X) - y) ** 2)
mse = np.mean((model.predict(X) - y) ** 2)
print 'Residual sum of squares: %.2f' % rrs
print 'Mean squared error: %.2f' % mse
Now that we've defined the cost function, we can find the set of parameters that minimize the RSS or MSE.
Recall the equation for simple linear regression:
$$y = \alpha + \beta{x}$$Solve the values of $\beta$ and $\alpha$ such that they minimize the RSS cost function.
Varience is a summary statistic that represents how spread apart a set of values is. Intuitively, the variance of set A = {0, 5, 10, 15, 20} is greater than the variance of set B = {5, 5, 5, 5, 5}. The formal definition of variance is:
$$var(x) = \dfrac{\sum_{i=1}^{n}\big(x_{i} - \bar{x}\big)^2}{n - 1}$$Where:
Let's implement variance in Python.
In [72]:
from __future__ import division
# calculate the mean
n = len(X)
xbar = sum([x[0] for x in X]) / n
# calculate the variance
variance = sum([(x[0] - xbar) ** 2 for x in X]) / (n - 1)
print 'Variance: %.2f' % variance
Covariance is a summary statistic that represents how two variables tend to change together. Suppose you have 3 sets:
We can say that cov(X, Y) is positive and the cov(X, Z) is negative. If there is no linear relationship between two variables, then their covariance will equal zero. For example, if we have a forth set of values:
Then cov(X, W) would be zero.
Let's do a sanity check on this intuition by implementing the formal definition of covariance:
$$ cov(x,y) = \dfrac{\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})}{n - 1} $$